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Abstract —Applying deductive verification to formally prove 
that a program respects its formal specification is a very complex 
and time-consuming task due in particular to the lack of feedback 
in case of proof failures. Along with a non-compliance between 
the code and its specification (due to an error in at least one of 
them), possible reasons of a proof failure include a missing or too 
weak specification for a called function or a loop, and lack of time 
or simply incapacity of the prover to finish a particular proof. 
This work proposes a new methodology where test generation 
helps to identify the reason of a proof failure and to exhibit a 
counter-example clearly illustrating the issue. We describe how 
to transform an annotated C program into C code suitable for 
testing and illustrate the benefits of the method on comprehensive 
examples. The method has been implemented in StaDy, a plugin 
of the software analysis platform Frama-C. Initial experiments 
show that detecting non-compliances and contract weaknesses 
allows to precisely diagnose most proof failures. 

Keywords: deductive verification, test generation, specifica¬ 
tion, proof failure, non-compliance detection, contract weakness 
detection, Frama-C 

I. Introduction 

Among formal verification techniques, deductive verifica¬ 
tion consists in establishing a rigorous mathematical proof that 
a given program meets its specification. When no confusion 
is possible, one also says that deductive verification consists 
in “proving a program”. It requires that the program comes 
with a formal specification, usually given in special comments 
called annotations, including function contracts (with pre- and 
postconditions) and loop contracts (with loop variants and 
invariants). The weakest precondition calculus proposed by 
Dijkstra m reduces any deductive verification problem to es¬ 
tablishing the validity of first-order formulas called verification 
conditions. 

In modular deductive verification of a function / calling 
another function g, the roles of the pre- and postconditions 
of / and of the callee g are dual. The precondition of / is 
assumed and its postcondition must be proved, while at any 
call of g in /, the precondition of g must be proved before 
the call and its postcondition is assumed after the call. The 
situation for a function / with one call to g is presented in 
Fig. [Ta] An arrow in this figure informally indicates that its 
initial point provides a hypothesis for a proof of its final point. 
For instance, the precondition Pref of / and the postcondition 
Postg of g provide hypotheses for a proof of the postcondition 
Postf of /. The called function g is proved separately. The 
verification of the loop invariant / of a loop in / is illustrated 
by Fig. [Tb] / must be proved to hold initially before the first 
loop iteration, and I A —b is assumed after exiting the loop. 


In addition, the preservation of the loop invariant / by each 
unique iteration of the loop must be established during the 
proof of /. (Loop termination, not illustrated in Fig. [Tb] can 
be proved as well.) 

To reflect the fact that 
some contracts become 
hypotheses during deduc¬ 
tive verification of / we 
use the term subcontracts 
for f to designate con¬ 
tracts of called functions 
and loops in /. 

Motivation. One of 

the most important diffi¬ 
culties in deductive ver¬ 
ification is the manual 
processing of proof fail¬ 
ures by the verification 
engineer since proof fail¬ 
ures may have several 
causes. Indeed, a failure 
to prove Pre g in Fig. [Tal 
may be due to a non- 
compliance of the code to 
the specification: an error 
in the code codei, or a 
wrong specification Pref 
or Pre g itself that may 
incorrectly formalize the 
requirements. The verifi¬ 
cation can also remain in¬ 
conclusive because of a 
prover incapacity to fin¬ 
ish a particular proof within an allocated time. In many cases, 
it is extremely difficult for the verification engineer to decide 
how to proceed: either suspect a non-compliance and look for 
an error in the code or check the specification, or suspect a 
prover incapacity, give up automatic proof and try to achieve 
an interactive proof with a proof assistant (like Coq 12). 

A failure to prove the postcondition Postf (cf. Fig. [Tal l is 
even more complex to analyze: along with a prover incapacity 
or a non-compliance due to errors in the pieces of code codei 
and code2 or an incorrect specification Pref or Postf, the 
failure can also result from a too weak postcondition Post g 
of g, that does not fully express the intended behavior of 
g. Notice that in this last case, the proof of g can still be 
successful. The current automated tools for program proving 



// Pref assumed 
f (<args>) { 
codei; 

// Pre g to be proved 
g (<args>) ; 

// Postg assumed 
code2; 

// Postf to be proved 

(a) called function g 

II Pref assumed 
f (<args>) { 
codei; 

// I to be proved 

while (b) { 

II I A b assumed 
code 3 ; 

// I to be proved 
} 

l/I A -16 assumed 
code2; 

} 

// Postf to be proved 

(b) loop 



Fig. 1: Verification of a function 
/ with a callee or a loop 




do not provide a precise indication on the reason of the proof 
failure. The most advanced tools (like Dafny Q) produce a 
counter-example extracted from the underlying solver without 
saying directly if the verification engineer should look for a 
non-compliance, or strengthen subcontracts (and which one 
of them), or consider adding additional lemmas or using 
interactive proof. So the verification engineer must basically 
consider all possible reasons one after another, maybe also 
trying a very costly interactive proof. For a loop, the situation 
is similar and offers an additional challenge: to prove the 
invariant preservation, whose failure can be due to several 
reasons as well. 

The motivation of this work is twofold. First, we want to 
provide the verification engineer with a more precise feedback 
indicating the reason of each proof failure. Second, we look 
for a counter-example that either confirms the non-compliance 
and demonstrates that the unproven predicate can indeed fail 
on a test datum, or confirms a subcontract weakness showing 
on a test datum which subcontract is insufficient. 

Approach and goals. We propose to use advanced test gen¬ 
eration techniques in order to diagnose a proof failure and pro¬ 
duce counter-examples. Their usage requires a translation of 
the annotated C program into an executable C code suitable for 
testing. Previous works addressed the generation of counter¬ 
examples only for non-compliance J4| and proposed a rule- 
based formalization of annotation translation in that case 0. 
The cases of subcontract weakness remained undetected and 
indistinguishable from a prover incapacity. The overall goal 
of the present work is to provide a methodology for a more 
precise identification of proof failure reasons in all these cases, 
to implement it and to evaluate it in practice. The proposed 
method is composed of two steps. The first step looks for non- 
compliance. If no non-compliance is detected, the second step 
looks for a subcontract weakness. Another goal is to make this 
method automatic and suitable for a non-expert verification 
engineer. Following the modular verification approach, we 
assume that the called functions respect their contracts. To 
simplify the presentation, we also assume that the loops 
preserve their loop invariants, and focus on other proof failures 
occurring during modular verification of /. (The proposed 
detection techniques can be adapted to the verification of a 
loop contract.) 

The contributions of this paper include: 

• a classification of proof failures into three categories: 
non-compliance, subcontract weakness and prover in¬ 
capacity, 

• a definition of counter-examples for the first two 
categories, 

• a new program transformation technique for the di¬ 
agnosis of a subcontract weakness by testing (in 
addition to the one previously proposed for non- 
compliance 0), 

• a complete testing-based methodology for diagnosis 
of proof failures and generation of counter-examples, 
suggesting possible actions for each category, illus¬ 
trated on several comprehensive examples, 

• an implementation of the proposed solution in a tool 
called StaDy, and 


• experiments showing its capacity of diagnosis of proof 
failures. 

Paper outline. Sections |TT] and [Til] respectively present the 
tools used in this work and an illustrative example. Section HV1 
defines the categories of proof failures and counter-examples, 
and presents program transformations for their identification. 
The complete methodology for the diagnosis of proof failures 
is presented in SectionjV] Our implementation and experiments 
are described in Sec. lY’H Finally, Sections [V III and [Vllll present 
some related works and a conclusion. 

II. Frama-C Toolset 

This work is realized in the context of the Frama-C 
toolset. Frama-C 0 is a platform dedicated to analysis of 
C programs that includes various source code analyzers in 
separate plugins. The Value plugin performs value analysis 
by abstract interpretation. The Wp plugin performs weakest 
precondition calculus for deductive verification of C programs. 
Several automatic SMT solvers can be used to prove the 
verification conditions generated by Wp. In this work we use 
Alt-Ergo 0.99.1 and CVC3 2.4.1. Frama-C also includes 
plugins for control-flow and program dependency graph con¬ 
struction, program slicing, impact analysis, test generation, etc. 

To express properties over C programs, Frama-C offers a 
behavioral specification language named ACSL Q, 0. ACSL 
annotations play a central role in communication between 
plugins: any analyzer can both add annotations to be verified 
by other ones and notify other plugins about its own analysis 
results by changing an annotation status. The status can 
indicate that the annotation is valid, valid under conditions, 
invalid or undetermined, and which analyzer established that 
result. 

For combinations with dynamic analysis, Frama-C also 
supports E-ACSL ( 21 , 0 , a rich executable subset of ACSL suit¬ 
able for runtime assertion checking. E-ACSL can express func¬ 
tion contracts (pre/postconditions, guarded behaviors, com¬ 
pleteness and disjointness of behaviors), assertions and loop 
contracts (variants and invariants). It supports quantifications 
over bounded intervals of integers, mathematical integers and 
memory-related constructs (e.g. on validity and initialization). 
It comes with an instrumentation-based translating plugin, 
called E-ACSL2C, that translates annotations into additional 
C code in order to evaluate annotations at runtime and report 
failures. Important differences between a translation for run¬ 
time assertion checking and a translation for test generation 
(e.g. to support unbounded integer arithmetics in E-ACSL and 
some specific annotations) 0 make E-ACSL2C inadequate for 
our work and create the need for a dedicated translation tool. 

For test generation, this work relies on PathCraw- 
LER HOl . a Dynamic Symbolic Execution testing tool, com¬ 
bining concrete and symbolic execution. PathCrawler is 
based on a specific constraint solver, COLIBRI, that imple¬ 
ments advanced features such as floating-point and modular 
integer arithmetics support. PathCrawler provides coverage 
strategies like k-path (feasible paths with at most k consecu¬ 
tive loop iterations) and all-paths (all feasible paths without 
any limitation on loop iterations). PathCrawler is sound, 
meaning that each test case activates the test objective for 
which it was generated. This is verified by concrete execution. 
PathCrawler is also complete in the following sense: when 
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/*@ predicate is_rgf(int *a, Z n) = 

a[0] == 0 A V Z i; l<i<n=^(0<a[i]< a[i-l]+l) ; */ 

/*@ lemma max_rgf: V int* a; V Z n; 

is_rgf (a, n) => (V Z i; 0<i<n=^a[i]<i); */ 

/*@ requires n > 0; 

requires \valid(a+(0..n-1)); 
requires 1 < i < n-1; 
requires is_rgf(a,i+1); 
assigns a[i+l..n-l]; 
ensures is_rgf(a,n); */ 
void g(int a[], int n, int i) { 
int k; 

/*@ loop invariant i+1 < k < n; 
loop invariant is_rgf(a,k); 
loop assigns k, a[i+1..n-1]; 
loop variant n-k; */ 
for (k = i+1; k < n; k++) a[k] = 0; 

} 

/*@ requires n > 0; 

requires \ valid (a+(0..n-1)); 
requires is_rgf(a,n); 
assigns a[l..n-l]; 
ensures is_rgf(a,n); 
ensures \result - 1 =>• 

3 Z j; 0 < j < n A 
(\at (a [ j ],Pre) <a[j] A 

VZk; 0 < k < j =>- \at (a [k] , Pre) == a [k] ) ; */ 

int f(int a[], int n) { 

int i,k; 

/*@ loop invariant 0 < i < n-1; 
loop assigns i; 
loop variant i; */ 

for (i = n-1; i > 1; i—) 
if (a[i] < a[i-l]) { break; } 

if (i == 0) { return 0; } // Last RGF. 

//@ assert a[i]+l < 2147483647; 
a [i] = a [i] + 1; 
g (a, n, i) ; 

/*@ assert V Z 1; 0 < 1 < i =£■ \at (a [1] , Pre) == a [1] ; */ 

return 1; 

} 

Fig. 2: Successor function for restricted growth functions 
(RGF) 

the tool manages to explore all feasible paths of the program, 
all features of the program are supported by the tool and 
constraint solving terminates for all paths, the absence of a 
test for some test objective means that the test objective is 
infeasible, since the tool does not approximate path constraints 
□S Sec. 3.1]. 

III. Illustrating Example 
We illustrate the issues arising in deductive verification 
of programs and the solutions we propose on the example 
of C program of Fig. [2 It comes from an ongoing work 
on formal specification and deductive verification IfTTI and 
implements an algorithm proposed in M page 235], The 
example of Fig.[2]concerns the generation of Restricted Growth 
Functions (RGF), defined by the property expressed by the 
ACSL predicate is_rgf on lines 1-2 of Fig. [2] where the 
RGF a is represented by the C array of its values. For 
convenience of the reader, some ACSL notations are replaced 
by mathematical symbols (e.g. keywords \exists, \foraii and 
integer are respectively denoted by 3, V and Z). 

Fig. [2] shows a main function f and an auxiliary function g. 
The precondition of f states that a is a valid array of size n>o 
(lines 22-23) and must be an RGF (line 24). The postcondition 
states that the function is only allowed to modify the values 
of array a except the first one a [o ] (line 25), and that the 


generated array a is still an RGF (line 26). Moreover, if the 
function returns 1 then the generated RGF a must respect an 
additional property (lines 27-30). Here \at (a[ j] ,pre> denotes 
the value of a[ j] in the pre state, i.e. before the function starts 
execution. 

We focus now on the body of the function f in Fig. [2] 
The loop on lines 36-37 goes through the array from right 
to left to find the rightmost non-increasing element, that is, 
the maximal array index i such that a[i] <a[i-u. If such an 
index i is found, the function increments a[i] (line 40) and 
fills out the rest of the array with o’s (call to g, line 41). The 
loop contract (lines 33-35) specifies the interval of values of 
the loop variable, the variable that the loop can modify as well 
as a loop variant that can be used to ensure the termination 
of the loop. The loop variant expression must be non-negative 
whenever an iteration starts, and strictly decrease after it. 

The function g is used to fill the array with zeros to the 
right of index i. In addition to size and validity constraints 
(lines 7-8), its precondition requires that the elements of a up 
to index i form an RGF (lines 9-10). The function is allowed 
to modify the elements of a starting from the index i+i (line 11) 
and generates an RGF (line 12). The loop invariants indicate 
the value interval of the loop variable k (line 15), and state 
that the property is_rgf is satisfied up to k (line 16). This 
invariant allows a deductive verification tool to deduce the 
postcondition. The annotation loop assigns (line 17) says that 
the only values the loop can change are k and the elements of 
a starting from the index i+i. The term n-k is a variant of the 
loop (line 18). 

The ACSL lemma max_rgf on lines 4-5 states that if an 
array is an RGF, then each of its elements is at most equal to 
its index. This lemma is not proved as such by Wp but can be 
used to ensure the absence of overflow at line 40. 

The functions of Fig. [2] can be fully proved using Wp. 
Suppose now this example contains one of the following four 
mistakes: the verification engineer either forgets the precondi¬ 
tion on line 24, or writes the wrong assignment a [i] =a [i ] +2 ; on 
line 40, or puts a too general clause loop assigns i,a[l..n-l]; 
on line 34, or forgets to provide the lemma on lines 4-5. In 
each of these four cases, the proof fails (for the precondition 
of g on line 41 and/or the assertion on line 39) for different 
reasons. In fact, only in the first two cases the code and 
specification are not compliant, while the third failure is due to 
a too weak subcontract, and the last one comes from a prover 
incapacity. To the best of our knowledge, none of the existing 
techniques allows to automatically distinguish the three reasons 
and suggest suitable actions. This work proposes a complete 
methodology to provide such assistance. 

IV. Categories of Proof Failures and 
Counter-Examples 

Let P be a C program annotated in E-ACSL, and / the 
function under verification in P. Function / is assumed to be 
recursion-free. It may call other functions, let g denote any of 
them. A test datum V for / is a vector of values for all input 
variables of /. The program path activated by a test datum V, 
denoted tt\z, is the sequence of program statements executed by 
the program on the test datum V. We use the general term of a 
contract to designate the set of E-ACSL annotations describing 
a loop or a function. A function contract is composed of 


pre- and postconditions including E-ACSL clauses requires, 
assigns and ensures (cf. lines 22-30 in Fig.0. A loop contract 

is COmpOSCd of loop invariant, loop variant and loop assigns 

clauses (cf. lines 15-18 in Fig. 0. 

Obviously, an annotation cannot be proved for all inputs if 
there exist inputs for which the property does not hold. The 
notion of counter-example depends on the way annotations 
are evaluated. The diagnosis of proof failures based on the 
prover’s counter-examples can be imprecise since from the 
prover’s point of view, the code of callees and loops in / 
is replaced by the corresponding subcontracts. To make this 
diagnosis more precise, we propose to take into account their 
code as well as their contracts, and to treat both by testing. In 
this section, we define three kinds of proof failure reasons, two 
kinds of counter-examples and associated detection techniques. 
Sec. IIV-AI defines a non-compliance and briefly recalls the 
detection technique previously published in J5J. Sec. IIV-BI is 
part of the original contribution of this paper, which introduces 
too new categories of proof failures and a new translation for 
test generation. 

A. Non-Compliance 

A previous work j5j formally described how to transform 
a C program P annotated in E-ACSL into an instrumented 
program, denoted P NC in this paper, on which we can apply 
test generation to produce test data violating some annota¬ 
tions at runtimeQ P NC checks all annotations of P in the 
corresponding program locations and reports any failure. For 
instance, the postcondition Postf of / is evaluated by the 
following code inserted at the end of the function / in P NC : 

int post_f; Spec2Code (Post /, post_f) ; fassert (post_f) ; (t) 

For an E-ACSL predicate p, we denote by Spec2Code(p, b) the 
generated C code evaluating the predicate p and assigning its 
validity status to the Boolean variable b (see Q for details). 
The function call fassert (b) is expanded into a conditional 
statement if(b) that reports the failure and exits whenever 
b is false. Similarly, preconditions and postconditions of a 
callee g are evaluated respectively before and after executing 
the function g. A loop invariant is checked before the loop 
(for being initially true) and after each loop iteration (for 
being preserved by the previous loop iteration). An assertion is 
checked at its location. To generate only test data that respect 
the precondition Pref of /, it is checked in the beginning of 
/ similarly to (f) except that fassert is replaced by fassume to 
assume the given condition. 

Definition 1 (Non-compliance): We say that there is a non- 
compliance between code and specification in P if there exists 
a test datum V for / respecting its precondition, such that P NC 
reports an annotation failure on V. In this case, we say that V 
is a non-compliance counter-example (NCCE). 

Test generation on the translated program P NC can be 
used to generate NCCEs (cf. Q). We call this technique 
Non-Compliance Detection (NCD). In this work we use the 
PathCrawler test generator that will try to cover all program 
paths. Since the translation step has added a branch for the 

'This translation is illustrated by Fig[K)]in Appendix lAl For simplicity, we 
present it for all annotations at the same time as in (5). Its adaptation for 
a modular approach, or even to a particular annotation whose proof fails, is 
straightforward. 


/*@ assigns kl,...,kN; l Type g g_swd(...){ 

@ ensures P; */ 2 kl=Nondet(); ... kN=Nondet(); 

Type g g(...){ codel; } 3 Type g ret = NondetO; 

4 int post; Spec2Code(P, post); 

_ y 5 fassume(post); return ret; 

6 } //respects contract of g 

7 Type g g(...){ codel; } 

Typef f (...){ code2; 8 Typef f (...){ code2; 

q{Args g ); 9 g_swd ( Args g ) ; 

code 3; } 10 code 3; } 

Fig. 3: (a) A contract c £ C of callee g in /, vs. (b) its 
translation for SWD 


false value of each annotation, PathCrawler will try to 
cover at least one path where the annotation does not hold. 
(An optimization in PathCrawler avoids covering the same 
fassert failure several times.) The NCD step may have three 
outcomes. It returns (nc, V, a) if an NCCE V has been found 
indicating the failing annotation a and recording the program 
path Try activated by V on P NC . Second, if it has managed to 
perform a complete exploration of all program paths without 
finding an NCCE, it returns no (cf. the discussion of com¬ 
pleteness in the end of Sec. ED- Otherwise, if only a partial 
exploration of program paths has been performed (due to a 
timeout, partial coverage criterion or any other limitation), it 
returns ? (unknown). 

B. Subcontract Weakness and Prover Incapacity 

To introduce the new categories of proof failures, we follow 
the modular verification approach and need a few definitions. 
A non-imbricated loop (resp. function, assertion) in / is a loop 
(resp. function called, assertion) in / outside any loop in /. A 
subcontract for f is the contract of some non-imbricated loop 
or function in /. A non-imbricated annotation in / is either a 
non-imbricated assertion or an annotation in a subcontract for 
/. For instance, the function / of Fig. [2] has two subcontracts: 
the contract of the called function g and the contract of the 
loop on lines 33-37. The contract of the loop in g on lines 
15-19 is not a subcontract for /, but is a subcontract for g. 

We focus on non-imbricated annotations in / and assume 
that all subcontracts for / are respected: the called functions in 
/ respect their contracts, and the loops in / preserve their loop 
invariants and respect all imbricated annotations. Let c/ denote 
the contract of /, C the set of non-imbricated subcontracts for 
/, and A the set of all non-imbricated annotations in / and the 
annotations of c/. In other words, A contains the annotations 
included in the contracts C U {c/} as well as non-imbricated 
assertions in /. We also assume that any subcontract of 
/ contains a (loop) assigns clause. This assumption is not 
restrictive since such a clause is anyway necessary for the 
proof of any nontrivial code. 

Subcontract weakness. To apply testing for the contracts 
of called functions and loops in C instead of their code, we 
use a program transformation of P producing a new program 
pGSW 'i ^g coc je of all non-imbricated function calls and loops 
in / is replaced by a new one as follows. 

For the contract c £ C of a called function g in /, the 
program transformation (illustrated by Fig. [3} generates a new 
function g_swd with the same signature whose code simulates 
any possible behavior respecting the postcondition in c, and 
replaces all calls to g by a call to g_swd. First, g_swd allows 
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1 Typef f (...){ codel; 1 

2 /*@ loop assigns xl,...,xN; 2 

3 @ loop invariant I; */ —} 3 

4 while(b) { code2; } 4 

5 code3; } 5 


Fig. 4: (a) A contract c £ C of a loop 


1 int x; 

2 /*@ ensures x > \old(x)+l; assigns x; */ 

3 void gl() { x=x+2; } 

4 /*@ ensures x > \old(x)+l; assigns x; */ 

5 void g2() { x=x+2; } 

6 /*@ ensures x > \old(x)+l; assigns x; */ 

7 void g3() { x=x+2; } 

8 /*@ ensures x > \old(x)+4; assigns x; */ 

9 void f() { gl(); g2(); g3 () ; } 

(a) Absence of SWCEs for any single subcontract does not imply 
absence of global SWCEs 

1 int x; 

2 /*@ ensures x > \old(x)+l; assigns x; */ 

3 void gl() { x=x+l; } 

4 /*@ ensures x > \old(x)+l; assigns x; */ 

5 void g2() { x=x+l; } 

6 /*@ ensures x > \old(x)+l; assigns x; */ 

7 void g3() { x=x+2; } 

8 /*@ ensures x > \old(x)+4; assigns x; */ 

9 void f() { gl(); g2(); g3 () ; } 

(b) Global SWCEs do not help to find precisely a too weak subcontract 

Fig. 5: Two examples with several subcontracts 

any of the variables (or, more generally, left-values) present in 
the assigns clause of c to change its value (line 2 in Fig[3b)). 

It can be realized by assigning a non-deterministic value of the 
appropriate type using a dedicated function, denoted here by 
Nondet o (or simply by adding an array of fresh input variables 
and reading a different value for each use and each function 
invocation). If the return type of g is not void, another non- 
deterministic value is read for the returned value ret (line 3 in 
FigEfb)). Finally, the validity of the postcondition is evaluated 
(taking into account these new non-deterministic values) and 
assumed in order to consider only executions that respect the 
postcondition, and the function returns (lines 4-5 in FigOb)). 

Similarly, for the contract c € C of a loop in /, the program 
transformation replaces the code of the loop by another code 
that simulates any possible behavior respecting c, that is, 
ensuring the “loop postcondition” I A after the loop as 
shown in Fig. |4] In addition, the transformation treats in the 
same way as in P NC all other annotations in A: preconditions 
of called functions, initial loop invariant verifications and the 
pre- and postcondition of / (they are not shown in Fig. [3jb) 
and |4fb)). 

Definition 2 (Global subcontract weakness): We say that 
P has a global subcontract weakness for / if there exists a test 
datum V for / respecting its precondition, such that P NC does 
not report any annotation failure on V, while P GSW reports 
an annotation failure on V. In this case, we say that V is a 
global subcontract weakness counter-example (GSWCE) for 
the set of subcontracts C. 

Notice that we do not consider the same counter-example 
as an NCCE and an SWCE. Indeed, even if some counter¬ 
examples may illustrate both a subcontract weakness and a 


Typef f(...){ codel; 

xl=Nondet(); ... xN=Nondet(); 

int invl; Spec2Code(I, invl) ; 

fassume(invl && !b); //respects loop contract 
code3; } 

in /, vs. (b) its translation for SWD 


non-compliance, we consider that non-compliances usually 
come from a direct conflict between the code and the specifica¬ 
tion and should be addressed first, while contract weaknesses 
are often more subtle and will be easier to address when non- 
compliances are eliminated. 

Again, test generation can be applied on P GSW to generate 
GSWCE candidates. When it finds a test datum V such that 
pGSW f a jj s on y , we use run tj me assertion checking: if P NC 
fails on V, then V is classified as an NCCE, otherwise V is a 
GSWCE. We call this technique Global Subcontract Weakness 
Detection for the set of all subcontracts, denoted GSWD. The 
GSWD step may have four outcomes. It returns (nc, V, a) if 
an NCCE V has been found for the failing annotation a, and 
(SW, V, a, C) if V has been finally classified as an SWCE 
indicating the failing annotation a and the set of subcontracts 
C. The program path tt\z activated by V and leading to the 
failure (on P NC or p GSW ) i s recorded as well. If the GSWD 
has managed to perform a complete exploration of all program 
paths without finding an GSWCE, it returns no. Otherwise, if 
only a partial exploration of program paths has been performed 
it returns ? (unknown). 

A GSWCE indicates a global subcontract weakness but 
does not explicitly identify which single subcontract c € C 
is too weak. To do that, we propose another program trans¬ 
formation of P into an instrumented program P® sw . It is 
realized by replacing only one non-imbricated function call or 
loop by the code respecting the postcondition of corresponding 
subcontract c (as indicated in Fig. [3] and [4]) and transforming 
other annotations in A as in P NC . 

Definition 3 (Single subcontract weakness): Let c be a 
subcontract for /. We say that c is a too weak subcontract (or 
has a single subcontract weakness ) for / if there exists a test 
datum V for / respecting its precondition, such that P NC does 
not report any annotation failure on V, while P c ssw reports 
an annotation failure on V. In this case, we say that V is 
a single subcontract weakness counter-example (SSWCE) for 
the subcontract c in /. 

For any subcontract c£C, test generation can be separately 
applied on P c ssw to generate SSWCE candidates. If such a 
test datum V is generated, it is checked on P NC to classify it 
as an NCCE or an SSWCE. We call this technique, applied 
for all subcontracts one after another until a first counter¬ 
example V is found. Single Contract Weakness Detection, 
denoted SSWD. The SSWD step may have three outcomes. 
It returns (nc, V, a) if an NCCE V has been found for a 
failing annotation a, and (SW, V, a, {c}) if V has been finally 
classified as an SSWCE indicating the failing annotation a 
and the single too weak subcontract c. The program path ttv 
activated by V and leading to the failure (on P NC or P c ssw ) 
is recorded as well. Otherwise, it returns ? (unknown), since 
even after an exhaustive path testing the absence of SSWCE 
for any individual subcontract c does not imply the absence of 
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Fig. 6: Combined verification methodology in case of a proof failure on P 
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a GSWCE. 

Indeed, sometimes SSWD cannot exhibit a subcontract 
weakness for a single subcontract while there is a global 
subcontract weakness for all of them at once. For example 
in Fig. [5a] if we apply SSWD to any of the subcontracts, we 
always have x >\oid(x)+5 at the end of / (we add 1 to a; 
by executing the translated subcontract, and add 2 twice by 
executing the other two functions’ code), so the postcondition 
of f holds and no weakness is detected. If we run GSWD to 
consider all subcontracts at once, we only get x>\oid <x) +3 after 
executing the three subcontracts, and can exhibit a counter¬ 
example. 

On the other hand, running GSWD produces a GSWCE 
that does not indicate which one of the subcontracts is too 
weak, while SSWD can sometimes be more precise. For 
Fig. [5b] since the three callees are replaced by their sub¬ 
contracts for GSWD, it is impossible to find out which 
one is too weak. Counter-examples generated by a prover 
suffer from the same precision issue: taking into account all 
subcontracts instead of the corresponding code prevents from 
a precise identification of a single too week subcontract. In 
this example we can be more precise with SSWD, since only 
the replacement of the subcontract of g3 also leads to an 
SSWCE: we can have x >\oid<x) +3 by executing gi, g2 and the 
subcontract of g 3 , exhibiting the contract weakness of g 3 . Thus, 
the proposed SSWD technique can provide the verification 
engineer with a more precise diagnostic than counter-examples 
extracted from a prover. 

We define a combined subcontract weakness detection 
technique, denoted SWD, applying first SSWD followed by 
GSWD until the first SWCE is found. SWD may have the 
same four outcomes as SSWD. It allows us to be both precise 
(and indicate when possible a single subcontract being too 
weak), and complete (capable to find GSWCEs even when 
there are no single subcontract weaknesses). 

Prover incapacity. When neither a non-compliance nor a 
global subcontract weakness exist, we cannot demonstrate that 
it is impossible to prove the property. 

Definition 4 (Prover incapacity): We say that a proof fail¬ 
ure in P is due to a prover incapacity if for any test datum 
V for / respecting its precondition, neither P NC nor P GSW 
report any annotation failure on V. In other words, there is no 
NCCE and no GSWCE for P. 

V. Diagnosis of Proof Failures using Structural 
Testing 

In this section, we present an overview of our method for 
diagnosis of proof failures using the detection techniques of 
Sec.UV] and illustrate it on several examples. We also provide a 
comprehensive list of suggestions of actions for each category 
of proof failures. 


The method. The proposed method is illustrated by Fig. [6] 
Suppose that the proof of the annotated program P fails for 
some non-imbricated annotation a A A. The first step tries to 
find a non-compliance using NCD. If such a non-compliance 
is found, it generates an NCCE (marked by (T) in Fig. [b] and 
classifies the proof failure as a non-compliance. If the first step 
cannot generate a counter-example, the SWD step combines 
SSWD and GSWD and tries to generate single SWCEs, then 
global SWCEs, until the first counter-example is generated 
and classified (either as an NCCE (T) or an SWCE (5)). 
If no counter-example has been found, the last step checks 
the outcomes. If both NCD and SWD have returned no, 
that is, both NCD and GSWD have performed a complete 
path exploration without finding a counter-example, the proof 
failure is classified as a prover incapacity (3) (cf. Def. |4). 
Otherwise, it remains unclassified (4). Fig. [7] associates a 
variant of the illustrating example to each case. For each case, 
we detail the lines we modified in the program of Fig. U to 
obtain a new program, the intermediate results of deductive 
verification, NCD and SWD and the final verdict (including 
the generated counter-example if any). 

The proof failure category and the counter-example V, 
along with the recorded path Tty, the reported failing anno¬ 
tation a and set of too weak subcontracts S, can be extremely 
helpful for the verification engineer. Suppose we try to prove 
in Wp a modified version of the function / of Fig. |2] where 
the precondition at line 24 is missing. The proof of the 
precondition of g on line 10 for the call on line 41 fails without 
indicating a precise reason. The NCD step of StaDy generates 
an NCCE (case (T), #1 in Fig .[7]) where ±s_rgf <a,n) is clearly 
false due to a [ 0 ] being non-zero, and indicates the failing 
annotation (coming from line 10). That helps the verification 
engineer to understand and fix the issue. 

Let us suppose now that the clause on line 34 has been er¬ 
roneously written as follows: loop assigns i, a[l..n-l] ;. The 
loop on lines 36-37 still preserves its invariant. The NCD step 
does not find any NCCE, as this modification did not introduce 
any non-compliance between the code and its specification. 
Thanks to the replacement shown in Fig. El SSWD for the 
contract of this loop will detect a single subcontract weakness 
for the loop contract (case (2), #2 in Fig. [7]>, and report a fail 
to establish the precondition of g (on line 10) for the call on 
line 41. With the indication of the single subcontract weakness 
for the loop, the verification engineer will try to strengthen the 
loop contract and find the issue. 

Suppose now we want to prove the absence of overflow 
at line 40 of Fig. [2] but the lemma on lines 4-5 (that allows 
the prover to deduce this property) is missing. The proof fails 
without giving a precise reason since the prover does not 
perform the induction needed to deduce the right bounds on 
a [i] . Neither NCD nor SWD can produce a counter-example, 
and as the initial program has too many paths, their outcomes 












# 

Impacted lines 

Intermediate outcome 

Final outcome 

Line 

Changes 

Proof (failing annot.) 

NCD 

SWD 

0 

- 

- 

/ 

- 

- 

Proved 

1 

24 

(deleted) 

? (1.39, 41, 26) 

nc 

- 

V = (n=l ; a [ 0] =—214739) is NCCE 

2 

34 

loop assigns i,a[l..n-l]; 

? (1.39, 41, 42, 26-30) 

? 

sw for 1.33-34 

V = ( n=2 ;a[0]=0;a[ 1 ] =0 ; 

nondet a[ i] = 97157 ; 

nondeti=o ) is SWCE 

3 

4-5 

22 

(deleted) 

requires n>0 && n<21; 

? (1.39) 

no 

no 

Prover incapacity 

4 

4-5 

(deleted) 

V (1.39) 

7 

? 

Unknown 


Fig. 7: Method results for different versions of the illustrating example. 


are ? (unknown) (case (3), #4 in Fig. [TJi. For such situations, 
StaDy offers the possibility to reduce the input domain. The 
verification engineer can add the ACSL clause typically n<5; 
to reduce the array size for testing (this clause is ignored by the 
proof). Running StaDy now allows the tool to complete the 
exploration of all program paths (for n<s) both for NCD and 
SWD without finding a counter-example. StaDy classifies 
the proof failure for the program with the reduced domain 
as a prover incapacity (case (3), #3 in Fig. [T}- That gives the 
verification engineer more confidence that the proof failure has 
the same reason on the initial program for bigger sizes n. 

The verification engineer prefers to try interactive proof or 
adding additional lemmas or assertions, and does not waste 
time looking for a bug or a too week subcontract. 

Suggestions of actions. From the possible outcomes of 
the method illustrated in Fig. [6] we are able to suggest to 
the verification engineer the most suitable actions (displayed 
in Fig. Hi to help her with the verification task. A non- 
compliance of the code w.r.t. annotation a means that there 
is an inconsistency between the precondition, the annotation 
a and the code of the path ny leading to a. Thanks to the 
counter-example, the values of variables at different program 
points along tty can be either traced or explored in a debugger 
lfl3l . In Frama-C, the execution on V can be conveniently 
explored using Value or PathCrawler. This helps the 
verification engineer to understand the issue. Indeed, if an 
NCCE is generated, there is no need to try automatic proof or 
look for a too weak subcontract — it will not help. The reason 
of the proof failure is necessarily related to a non-compliance 
between the code and annotations traversed by the path ~y. 

A weakness of a set of subcontracts S means that at least 
one of the contracts of C has to be strengthened. By Definitions 
|2] and [3] the non-compliance is excluded here, that is, the 
execution of P NC on V respects the annotation a, thus the 
suggested action is to strengthen the subcontract(s). In the 
case of single subcontract weakness, S is a singleton so the 
suggestion is very precise and helpful to the user. Again, trying 
interactive proof or additional assertions or lemmas will be 
useless here since the property can obviously not be proved 
because of the counter-example. For a prover incapacity, the 
verification engineer may write lemmas or assertions, add 
hypotheses that may help the theorem prover to succeed or 
try another theorem prover. She also may want to use a 
proof assistant like Coq, so that she does not suffer from 
the limitations of the theorem provers, but this task can be 
more complex and time-consuming. Finally, when the verdict 
is unknown, test generation for NCD and/or SWD times out, 
so the verification engineer may strengthen the precondition 


for testing to reduce the input domain, or extend the timeout 
to give StaDy more time to conclude. 

VI. Implementation and Experiments 

Implementation. The proposed method for diagnosis of 
proof failures has been implemented as a Frama-C plugin, 
named StaDy. It relies on other plugins: Wp for de¬ 
ductive verification and PathCrawler 01 for structural 
test generation. StaDy currently supports a significant subset 
of the E-ACSL specification language, including requires, 

ensures, behavior, assumes, loop invariant, loop variant 3.11(1 

assert clauses. Quantified predicates \exists and \foraii and 
builtin terms as \sum or \numof are translated as loops. Logic 
functions and named predicates are treated by inlining. The 
\oid constructs are treated by saving the initial values of 
formal parameters and global variables at the beginning of the 
function. Validity checks of pointers are partially supported 
due to the current limitation of the underlying test generator: 
we can only check the validity of input pointers and global 
arrays. The assigns clauses are only taken into consideration 
during the SWD phase: we do not aim to find what is missing 
in the assigns clause (NCD) because provers usually give 
sufficiently good feedback about it, but we want to find what 
is unnecessary and could be removed from an assigns clause 
(SWD). Inductive predicates, recursive functions and floating¬ 
point numbers are currently not supported and are part of our 
future work. 

The research questions we address in our experiments are 
the following. 

RQ1 Is StaDy able to precisely diagnose most proof 
failures in C programs? 

RQ2 What are the benefits of the SWD extension (in 
particular, with respect to NCD)? 

RQ3 Is StaDy able to generate NCCEs or SWCEs even 
with a partial testing coverage? 

RQ4 Is StaDy’s execution time comparable to the time of 
an automatic proof? 

Experimental protocol. The evaluation used 20 annotated 
programs from IH, whose size varies from 35 to 100 lines of 
annotated C code. These programs manipulate arrays, they are 
fully specified in ACSL and their specification expresses non¬ 
trivial properties of C arrays. To evaluate the method presented 
in Sec. m and its implementation, we apply StaDy on sys¬ 
tematically generated altered versions (or mutants) of correct 
C programs. Each mutant program is obtained by performing 
a single modification (or mutation) on the initial program. The 
mutations include: a binary operator modification in the code 














Case 

Verdict 

Suggestions 

(I) 

Non-compliance w.r.t. the annotation a: 
(nc, V, a) 

check the violated annotation a or the code leading to 
a in the path Try, or strengthen the precondition of the 
function under verification 

(2) 

Weakness of subcontracts in S w.r.t. the 
annotation a: (SW, V, a, S ) 

strengthen one or several subcontracts in S to exclude 
the subcontract weakness 

(3) 

Prover incapacity 

add lemmas or assertions to help the theorem prover, 
or use another prover, or an interactive proof assistant 

(4) 

Unknown 

strengthen the typically clause or coverage criterion 
(e.g. /c-path), or increase the timeout limit for testing 


Fig. 8: Suggestions of actions for different categories of proof failures 


or in the specification, a condition negation in the code, a 
relation modification in the specification, a predicate negation 
in the specification, a partial loop invariant or postcondition 
deletion in the specification. In this study, we do not mutate 
the precondition of the function under verification, and restrict 
possible mutations on binary operators to avoid creating absurd 
expressions, in particular for pointer arithmetics. 

The first step tries to prove each mutant using Wp. The 
proved mutants respect the specification and are classified as 
correct. Second, we apply the NCD method on the remaining 
mutants. It classifies proof failures for some mutants as non- 
compliances, indicates the failing annotation and an NCCE. 
The third step applies the SWD method on remaining mutants, 
classifies some of them as subcontract weaknesses, indicates 
the weak subcontract and a SWCE. If no counter-example 
has been found by the SWD, the mutant remains unclassified. 
The results are displayed in Fig. [9] The columns present the 
number of generated mutants, and the results of each of the 
three steps: the number (#) and ratio (%) of classified mutants, 
maximal and average execution time (put on two lines) of 
the step over classified mutants (r or t*) and over non- 
classified mutants (/,') at this step. The ratios are computed 
with respect to unclassified mutants after the previous step. The 
NCD + SWD columns sum up selected results after both NCD 
and SWD steps: the average and maximal time ( t ) are shown 
globally over all mutants. The time is computed until the proof 
is finished or until the first counter-example is generated. The 
final number of remaining unclassified mutants (#?) is given 
in the last column. 

Experimental results. For the 20 considered programs, 
928 mutants have been generated. 80 of them have been 
proved by Wp. Among the 848 unproven mutants, NCD has 
detected a non-compliance induced by the mutation in 776 
mutants (91.5%), leaving 72 unclassified. Among them, SWD 
has been able to exhibit a counter-example (either a NCCE or 
a SWCE) for 48 of them (66.7%), finally leaving 24 programs 
unclassified. They can be either equivalent mutants that were 
not proved by Wp due to a prover incapacity, or mutants 
coming from a mutation in an unsupported annotation being 
undetectable by the current version, or incorrect mutants for 
which testing was incomplete due to a timeout. Regarding 
RQ1, StaDy has found a precise reason of the proof failures 
and produced a counter-example in 824 of the 848 unproven 
mutants, i.e. classifying 97.2%. Exploring the benefits of 
detecting a prover incapacity may often require to manually 
reduce the input domain, to try additional lemmas or interactive 


proof, so it was not sufficiently investigated in this study (and 
would probably require another, non mutational approach). 

Regarding RQ2, NCD alone diagnosed 776 of 848 un¬ 
proven mutants (91.5%). SWD diagnosed 48 of the 72 re¬ 
maining mutants (66.7%) bringing a significant complementary 
contribution to a better understanding of reasons of many proof 
failures. 

In our experiments, each prover can try to prove each 
verification condition during at most 40 seconds. We also 
set a timeout for any test generation session to 5 seconds, 
i.e. one session for the NCD step, and several sessions for 
SWD steps. We also limit the depth of explored program 
paths with the k-path criterion (cf. Sec. ED) setting k = 4. 
Both the session timeout and the k-path heavily limit the 
testing coverage but StaDy still detects 97.2% of faults in 
the generated programs. That addresses RQ3 and demonstrates 
that the proposed method can efficiently classify proof failures 
and generate counter-examples even with a partial testing 
coverage and can therefore be used for programs where the 
total number of paths cannot be limited (e.g. by the typically 
clause). 

Concerning RQ4, on the considered programs Wp needs 
on average 2.6 sec. per mutant (at most 4.4 sec.) to prove a 
program, and spends 13.0 sec. on average (at most 61.3 sec.) 
when the proof fails. The total execution time of StaDy is 
comparable: it needs on average 2.7 sec. per unproven mutant 
(at most 19.9 sec.). 

Summary. The experiments show that the proposed 
method can automatically classify a significant number of 
proof failures within an analysis time comparable to the time of 
an automatic proof and for programs for which only a partial 
testing coverage is possible. The SWD technique offers an 
efficient complement to NCD for a more complete and more 
precise diagnosis of proof failures. 

Threats to validity. As it is often the case in software 
verification studies, one major threat is related to the rep¬ 
resentativeness of results, i.e. their external validity. In our 
case, due to the nature of the problem, we are restricted to 
realistic annotated programs that cannot be generated auto¬ 
matically or extracted from existing databases of unspecified 
code. Therefore, to reduce this threat, we used programs from 
an independent benchmark 131 created in order to illustrate 
on different examples the usage of the ACSL specification 
language for deductive verification with Frama-C. 
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Fig. 9: Detailed experiments of proof failure diagnosis for mutants with StaDy 


Scalability of the results is another threat since we do not 
demonstrate their validity for functions of larger programs. 
Because of the modular reasoning of deductive verification, 
it can be argued that the proposed technique should only be 
applied on a unit level, separately for each function, since the 
verification engineer proves a program in this way. Indeed, 
in the current practice of deductive verification, it does not 
make sense to analyze proof failures for the whole module or 
application at the same time. 

The main scalability concern is thus related to the usage 
of structural test generation that can often time out without 
achieving a full coverage. To address this issue, we have 
specifically investigated the impact of a partial test coverage on 
the effectiveness of the method (cf. RQ3 above) and proposed 
a convenient way to reduce the input domain (using typically 
clause, an extension of ACSL). 

Other threats can be due to the used measurements, i.e. 
construct validity. To reduce this threat, we used a careful 
measurement of results (including analysis time for each step 
and each mutant, their mean and maximal values, separately 
computed for classified and unclassified proof failures). One 
concern is producing realistic situations in which the ver¬ 
ification engineer can need help in the analysis of proof 


failures. While the first users of StaDy have appreciated its 
feedback, we have not yet had the opportunity to organize a fair 
evaluation with a representative group of users. Thus we have 
performed an extended set of experiments using simulation of 
errors by mutations as an alternative in the meanwhile. We 
have chosen a large subset of mutation operators (mutation in 
the code, mutation in an annotation, deletion of an annotation) 
that model frequent problematic situations (incorrect code or 
annotations, incomplete specification) leading to proof failures. 
This approach looks suitable for non-compliance and subcon¬ 
tract weaknesses, and certainly less suitable for the more subtle 
prover incapacity cases. The results should be later confirmed 
by a representative user study. 

VII. Related Work 

Understanding proof failures. A two-step verification 
in E3 compares the proof failures of an Eiffel program with 
those of its variant where called functions are inlined and loops 
are unrolled. It reports code and contract revision suggestions 
from this comparison. Inlining and unrolling are respectively 
limited to a given number of nested calls and explicit iterations. 
If that number is too small the semantics is lost and a warning 
of unsoundness is also reported to the user. 

Proof tree analysis. More precision can be statically 









































obtained by analyzing the unclosed branches of a proof tree. 
The work m is performed in the context of KeY and its 
verification calculus that applies deduction rules to a dynamic 
formula mixing a program and its specification. It proposes 
falsifiability preservation checking that helps to distinguish 
whether the branch failure comes from a programming error or 
from a contract weakness. However this technique can detect 
bugs only if contracts are strong enough. Moreover it is auto¬ 
matic only if a prover (typically, an SMT solver) can decide 
the non-satisfiability of the first-order formula expressing the 
falsifiability preservation condition. 03 exploits the proof 
trees generated during a proof attempt by KeY. The relevance 
of generated tests depends on the quality of the specification 
written by the user, and it does not allow to distinguish non- 
compliances from specification weaknesses. 

Combination of static and dynamic analysis. Static 
and dynamic analysis work better when used together, as in 
the method SYNERGY GU, its interprocedural and compo¬ 
sitional extension in Smash fT9l , the method sante iftol 
and the present method. Static analysis maintains an over¬ 
approximation that aims at verifying the correctness of the sys¬ 
tem, while dynamic analysis maintains an under-approximation 
trying to detect an error. Both abstractions help each other 
in a way similar to the counter-example guided abstraction 
refinement method (CEGAR) ETl . 

Counter-examples for non-inductive invariants. 

Counter-examples can be generated to show that invariants 
proposed for transition systems are too strong or too 
weak |[22l . Differences with our work are the focus on 
invariants, the formalism of transition systems, and the use of 
random testing (with QuickCheck). 

Other verification feedbacks. Our goal was to find input 
data to illustrate proof failures. A complementary work G2) 
proposed to extend a runtime assertion checker to use it as 
a debugger to help the user understand complex proof failure 
counter-examples. The DAFNY development environment El 
provides verification feedback to the user during the program¬ 
ming phase. It integrates the Boogie Verification Debugger 
l23l that helps the understanding of verification tools like 
Boogie. Currently, Dafny only uses counter-examples pro¬ 
vided by the solver, and does not produce as much information 
when verification times out as it does when verification fails. 

Checking prover assumptions. Axioms are logic prop¬ 
erties used as hypotheses by provers and thus usually not 
checked. Model-based testing applied to a computational 
model of an axiom permits to detect errors in axioms and 
thus to maintain the soundness of the axiomatization 124) . This 
work is complementary to ours because it tackles the case of 
deductive verification trivially succeeding due to an invalid 
axiomatization, whereas we tackle the case of inconclusive 
deductive verification. Il25l proposed to complete the results 
of static checkers with dynamic symbolic execution using 
Pex. The explicit assumptions used by the verifier (absence 
of overflows, non-aliasing, etc.) create new branches in the 
program’s control flow graph which Pex tries to explore. 
This approach permits to detect errors out of the scope of 
the considered static checkers, but does not provide counter¬ 
examples in case of a specification weakness. 

The present work continues the previous efforts to facili¬ 


tate deductive verification by generating counter-examples. We 
propose an original detection technique of three categories of 
proof failure that gives a more precise diagnostic than in the 
previous work using testing. Thanks to the separate detection 
of non-compliances and single subcontract weaknesses, the 
generated counter-examples can better identify the reasons of 
proof failures than those extracted from a solver. To the best 
of our knowledge, such a complete testing-based methodology 
proposed in this paper, automatically providing to the verifi¬ 
cation engineer a precise feedback on proof failures was not 
studied, implemented and evaluated before. 

VIII. Conclusion and Future Work 

We proposed a new approach to improve the user feedback 
in case of a proof failure. Our method relies on test generation 
and helps to decide whether the proof has failed or timed out 
due to a non-compliance between the code and the specifi¬ 
cation, a subcontract weakness, or a prover weakness. This 
approach is based on a spec-to-code program transformation 
that allows to use a test generator taking a C program as input. 
The transformation for SWD is an original contribution of this 
paper. Our experiments show that our implementation - as a 
Frama-C plugin, StaDy- was able to diagnose over 97% of 
the programs (generated by introducing a mutation in a verified 
program). 

One benefit of the proposed approach is the capacity to 
provide the verification engineer with a precise reason of a 
proof failure that helps to choose the right way to proceed and 
facilitates the processing of proof failures. Counter-examples 
illustrate the issue on concrete values and help to find out 
more easily why the proof fails. The method is completely 
automatic, relies on the existing specification and does not 
require any additional manual specification or instrumentation 
task. As a consequence, this method can be adopted by less 
experienced verification engineers and software developers. 

One requirement of the complete method coming from 
test generation is to have the C code of called functions, 
while the GSWD technique remains applicable even without 
source code. Another limitation is related to a potentially 
very big number of program path, that cannot be explored. 
Initial experiments show that proof failures can be classified in 
practice even after test generation with a partial test coverage, 
within a testing time comparable to the time of the proof. 

We are convinced that the proposed methodology facilitates 
the verification task by lowering the level of expertise required 
to conduct a deductive program proof, removing one of the 
major obstacles for a wider use of deductive verification in in¬ 
dustry. Future work includes further evaluation of the proposed 
methodology, a study of optimized combinations of NCD and 
SWD for subsets of annotations and subcontracts, experiments 
on a larger class of programs and a better support of E-ACSL 
constructs in our implementation (inductive predicates, validity 
of non-input pointers). 
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Fig. 10: (a) An annotated code, vs. (b) its translation for NCD 


E. W. Dijkstra, A Discipline of Programming, ser. In: Series in Auto¬ 
matic Computation. Englewood Cliffs: Prentice Hall, 1976. 

The Coq Development Team, “The Coq Proof Assistant,” 
http://coq.inria.fr 

K. R. M. Leino and V. Wiistholz, “The dafny integrated development 
environment,” in F-IDE, 2014. 

G. Petiot, N. Kosmatov, A. Giorgetti, and J. Julliand, “How test 
generation helps software specification and deductive verification in 
Frama-C,” in TAP, 2014. 

G. Petiot, B. Botella, J. Julliand, N. Kosmatov, and J. Signoles, “In¬ 
strumentation of annotated C programs for test generation,” in SCAM, 
2014. 

P. Cuoq, F. Kirchner, N. Kosmatov, V. Prevosto, J. Signoles, and 
B. Yakobowski, “Frama-C - a software analysis perspective,” in SEFM, 
2012 . 

P. Baudin, P. Cuoq, J. C. Filliatre, C. Marche, B. Monate, Y. Moy, 
and V. Prevosto, ACSL: ANSI/ISO C Specification Language, URL: 
http ://frama-c. com/acsl. html. 

M. Delahaye, N. Kosmatov, and J. Signoles, “Common specification 
language for static and dynamic analysis of C programs,” in SAC , 2013. 
J. Signoles, E-ACSL: Executable ANSI/ISO C Specification Language. 

B. Botella, M. Delahaye, S. Hong Tuan Ha, N. Kosmatov, P. Mouy, 
M. Roger, and N. Williams, “Automating structural testing of C pro¬ 
grams: Experience with PathCrawler,” in AST, 2009. 

R. Genestier, A. Giorgetti, and G. Petiot, “Sequential generation of 
structured arrays and its deductive verification,” in TAP, 2015. 

J. Arndt, Matters Computational - Ideas, Algorithms, Source Code [The 
fxtbook], 2010, published electronically at http://www.jjj.de. 

P. Muller and J. N. Ruskiewicz, “Using debuggers to understand failed 
verification attempts,” in FM, 2011. 

J. Burghardt and J. Gerlach, ACSL by Ex¬ 
ample, 2015, published electronically at 

http://www.fokus.fraunhofer.de/download/acsl_by_example. 

J. Tschannen, C. A. Furia, M. Nordio, and B. Meyer, “Program checking 
with less hassle,” in Verified Software: Theories, Tools, Experiments, 
2014. 

C. Gladisch, “Could we have chosen a better loop invariant or method 
contract?” in TAP, 2009. 

C. Engel and R. Hahnle, “Generating unit tests from formal proofs,” in 
TAP, 2007. 

B. S. Gulavani, T. A. Henzinger, Y. Kannan, A. V. Nori, and S. K. 
Rajamani, “SYNERGY: A new algorithm for property checking,” in 
FSE, 2006. 

P. Godefroid, A. V. Nori, S. K. Rajamani, and S. D. Tetali, “Composi¬ 
tional may-must program analysis: unleashing the power of alternation,” 
in POPL, 2010. 

O. Chebaro, N. Kosmatov, A. Giorgetti, and J. Julliand, “Program 
slicing enhances a verification technique combining static and dynamic 
analysis” in SAC, 2012. 

E. Clarke, O. Grumberg, S. Jha, Y. Lu, and H. Veith, “Counterexample- 
guided abstraction refinement for symbolic model checking,” J. ACM, 
2003. 

K. Claessen and H. Svensson, “Finding counter examples in induction 
proofs,” in TAP, 2008. 

C. Le Goues, K. R. M. Leino, and M. Moskal, “The boogie verification 
debugger,” in SEFM, 2011. 

K. Y. Ahn and E. Denney, “Testing first-order logic axioms in program 
verification,” in TAP, 2010. 

M. Christakis, P. Miiller, and V. Wiistholz, “Collaborative verification 
and testing with explicit assumptions,” in FM, 2012. 


Appendix 

Program transformation for non-compliance detection. 

Fig. U0l illustrates the translation of an annotated program P 
into another C program, P NC , that is used to generate counter¬ 
examples during non-compliance detection (NCD). 


